Red Wine Quality Exploration by Shinichiro Tanaka

Univariate Plots Section

First I explore the basic structure of red wine data set and get its summary.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

There are 1599 red wine observations in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality).

75% of red wines have volatile acidity equal to or less than 0.64 g/dm^3. The minimum value of citric acid is 0.0 g/dm^3 and 25% of red wines have citric acid equal to or less than 0.09 g/dm^3. Residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, and sulphates are equal to or less than 2.6 g/dm^3, 0.09 g/dm^3, 21.0 mg/dm^3, 62.0 mg/dm^3, and 0.73 g/dm^3, respectively, for 75% of red wines. Quality of red wines only takes integer values, which means it can be treated as an ordered variable. The worst, median, and best quality for red wines are 3, 6 and 8, respectively.

Next I will look at the distributions of all the variables (fixed acidity, volatile acidity, …, and quality) except “X” by plotting histograms to get an overview about what the data set is like.

It is notable that the distributions of fixed acidity and volatile acidity are not symmetrical but both skewed right.

Here I used a small binwidth to better understand the distribution of citric acid. The distribution shows a very large peak at around zero. I wonder if these data are really correct. By using “count” function, I found citric acid is zero for 132 (8.3%) red wine data. Most red wines have citric acid less than 0.75 g/dm^3 but there is an outlier at 1.0 g/dm^3. The calculations are shown below.

## Source: local data frame [2 x 2]
## 
##   citric.acid == 0    n
## 1            FALSE 1467
## 2             TRUE  132
## Source: local data frame [2 x 2]
## 
##   citric.acid == 1    n
## 1            FALSE 1598
## 2             TRUE    1

Resume plotting histograms…

The distribution of residual sugar is also skewed right. Transformed the long tail data to better understand it.

Used a small binwidth and changed the upper limit of chlorides to better understand the long-tail distribution. Most red wines have chlorides less than 0.2 g/dm^3 but there are some outliers.

Free and total sulfur dioxide also show right-skewed distributions. Used a small binwidth to better understand the distribution of free sulfur dioxide. Free sulfur dioxide is less than 60 mg/dm^3 for most red wines but there are some outliers. Total sulfur dioxide is less than 150 mg/dm^3 for most red wines but there are some outliers.

Density and pH are almost normally distributed, while sulphates and alcohol are again skewed right. Sulphates is less than 1.5 g/dm^3 for most red wines but there are some outliers. Used a small binwidth to better understand the distribution of alcohol.

Quality can be regarded as an ordered variable between 3 to 8. As shown in the calculations below, 1319 out of 1599 red wines (82%) have intermediate quality of 5 or 6, while only 28 (1.8%) have quality of 3 or 8. Taking all these findings into account, I wonder if a linear model using some of the features can be a good method to predict the red wine quality, or we need to consider some other ways.

## Source: local data frame [2 x 2]
## 
##   quality == 5   n
## 1        FALSE 918
## 2         TRUE 681
## Source: local data frame [2 x 2]
## 
##   quality == 6   n
## 1        FALSE 961
## 2         TRUE 638
## Source: local data frame [2 x 2]
## 
##   quality == 3    n
## 1        FALSE 1589
## 2         TRUE   10
## Source: local data frame [2 x 2]
## 
##   quality == 8    n
## 1        FALSE 1581
## 2         TRUE   18

In addition, I’m interested in the asymmetric distribution of red wine quality (there are more red wines with quality = 7 than those with quality = 3). I wonder if this has anything with the right-skewed distributions of fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, sulphates and alcohol.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wine observations in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). Quality is an ordered variable with min = 3 and max = 8.

Other observations:
The distributions of fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, sulphates and alcohol are skewed right.
About 8.3% of red wines have zero citric acid.
There are some outliers in the data of citric acid, chlorides, free and total sulfur dioxide, as well as sulphates.
Most red wines have a quality of 5 or 6 and only less than 2% have a quality of 3 or 8.
The distribution of quality is asymmetric where there are more red wines with quality = 7 than those with quality = 4.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, suplhates, alcohol and quality. I’d like to know which variables contribute most to the quality of red wine. By looking at the description about the Red Wine Quality dataset provided by Cortez et al., I suspect volatile acidity, citric acid, residual sugar, total sulfur dioxide could be main features to predict the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I suspect the asymmetric distribution of quality has something to do with the right-skewed feature of variables like volatile acidity or alcohol.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

About 8.3% of red wines have zero citric acid. Although it’s not clear whether if those data are correct or due to measurement errors, I will consider removing them as outliers in the future investigation to see if it helps predict quality of red wine more clearly.

Bivariate Plots Section

I’ll start the bivariate analysis by calculating correlation coefficients for each pair of variables.

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000

Although we need to investigate 2D scatter plots to look at the correlation between two variables in detail, it is notable that the following pairs have relatively large correlation coefficients:
quality and volatile acidity (negative)
quality and citric acid, sulphates and alcohol (positive)
density and alcohol (negative)
density and residual sugar (positive)
pH and citric acid (negative)
free sulfur dioxide and total sulfur dioxide (positive)

The pair plot below provides an overview on the relationships between the variables in the data set. I omitted variable “X” and add “smooth” on the plots.

I want to look closer at the plots involving quality and other variables: fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, sulphates, alcohol, and so on.

The first one is quality vs. fixed acidity.

Since I found a scatter plot doesn’t help much to see the relationship between quality and fixed acidity, I created a box plot instead. I’ll be using the same methods for other variables in later investigations. Here we see a weak trend that red wines with better quality have a larger median fixed acidity, but it is not clear.

Resume boxplots…

Red wines with better quality have a smaller median volatile acidity. When volatile acidity is larger than 0.8 g/dm^3, the quality of red wines can hardly be 7 or better.

In the second plot above I removed data with citric acid = 0.0. We can see a trend that red wines with better quality have a larger median citric acid.

In the second plot above I limited the range of residual sugar to (0, 8) in order to better understand the change in the median values. No clear relationship between residual sugar and quality.

I modified the range of chlorides in the second plot above. We see a weak trend that red wines with better quality have a smaller median chlorides, but it is not clear.

There is a nonlinear relationship between quality and the median free sulfur dioxide.

Again, there is a nonlinear trend between quality and the median total sulfur dioxide. I’ll look at this in more detail afterward.

We see a trend where red wines with better quality have a smaller median density.

We see a trend where red wines with better quality have a smaller median pH.

Red wines with better quality have a larger median sulphates.

Red wines with better quality have a larger median alcohol. If alcohol is smaller than 10 % by volume, the quality of red wines gets mostly 6 or worse.

So far I found the quality of red wines has a relatively clear relationship with volatile acidity, citric acid, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. But here I also want to look at the relationships between selected pairs of variables to know if there are essentially the same values that we should not use simultaneously.

The following seven plots are for the pairs including volatile acidity.

It is clear that volatile acidity negatively correlates with citric acid.

Next, the pairs including citric acid.

Citric acid negatively correlates with pH.

Next, the pairs including free sulfur dioxide.

Free sulfur dioxide positively correlates with total sulfur dioxide.

Next, the pairs including total sulfur dioxide.

Unable to find any clear correlations.

Next, the pairs including density.

Density and alcohol are negatively correlated.

Next, the pairs including pH.

None of the two shows a clear correlation.

The last one is sulphates vs. alcohol, which neither shows a clear relationship.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Red wines with better quality tend to have a smaller volatile acidity, larger citric acid, smaller density, smaller pH, larger sulphates, and larger alcohol.
If volatile acidity is larger than 0.8 g/dm^3, the quality of red wines can hardly be 7 or better.
If alcohol is smaller than 10 % by volume, the quality of red wines gets mostly 6 or worse.
Quality of red wines has a nonlinear relationship with free sulfur dioxide and total sulfur dioxide.
There are also weak trends where red wines with better quality tend to have a larger fixed acidity and a smaller chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile acidity negatively correlates with citric acid.
Citric acid negatively correlates with pH.
Free sulfur dioxide positively correlates with total sulfur dioxide.
Density and alcohol are negatively correlated.

What was the strongest relationship you found?

The quality of red wine has the strongest positive correlation with alcohol (correlation coefficient is 0.48). The negative correlation between quality and volatile acidity is also strong (correlation coefficient is -0.39).

Multivariate Plots Section

Here I will see how quality of red wines distribute on a scatter plot defined by two of the features. I use volatile acidity, citric acid, sulphates, and total sulfur dioxide, in addition to alcohol as the main features for this investigation since the other features have a correlation with at least one of these five or have been found to have little to do with quality (like fixed acidity or chlorides).
I included citric acid in the main features though it correlates with volatile acidity. This is because while citric acid correlates also with pH, the correlation between volatile acidity and pH was not clear enough, which made me doubtful about that citric acid is largely dependent on volatile acid.

We see red wines with better quality are distributed in the region with small volatile acidity and large alcohol, while worse ones are in the region with large volatile acidity and small alcohol. This suggests that we could build a model to classify red wines by some clustering techniques.

In the second plot I removed data with citric acid is zero as outliers. We see better wines have larger alcohol and citric acid.

Red wines with better quality have larger alcohol and sulphates.

This is a little complicated plot. In general better wines have larger alcohol. But at the same time, many extreme values (best and worst wines) have small total sulfur dioxide. This is consistent with the nonlinear relationship between quality and total sulfur dioxide discussed in the Bivariate Plots Section. I don’t go further into this point since it’s not a clear trend.

All the plots above taken into account, I think alcohol and volatile acidity are the best features to predict the quality of red wines. We could apply some classification methods on the alcohol vs. volatile acidity scatter plot to categorize red wines into different quality.

Here I also try a linear model using alcohol, volatile acidity, citric acid, and sulphates to predict the quality of red wines.

#Examine a linear model to predict quality
m1 <- lm(formula=quality ~ alcohol, data=rdw)
m2 <- lm(formula=quality ~ alcohol + volatile.acidity, data=rdw)
m3 <- lm(formula=quality ~ alcohol + volatile.acidity + citric.acid, data=rdw)
m4 <- lm(formula=quality ~ alcohol + volatile.acidity + citric.acid + 
           sulphates, data=rdw)
mtable(m1, m2, m3, m4)
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = rdw)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = rdw)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid, 
##     data = rdw)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid + 
##     sulphates, data = rdw)
## 
## =========================================================
##                      m1        m2        m3        m4    
## ---------------------------------------------------------
## (Intercept)        1.875***  3.095***  3.055***  2.646***
##                   (0.175)   (0.184)   (0.194)   (0.201)  
## alcohol            0.361***  0.314***  0.314***  0.309***
##                   (0.017)   (0.016)   (0.016)   (0.016)  
## volatile.acidity            -1.384*** -1.343*** -1.265***
##                             (0.095)   (0.114)   (0.113)  
## citric.acid                            0.068    -0.079   
##                                       (0.103)   (0.104)  
## sulphates                                        0.696***
##                                                 (0.103)  
## ---------------------------------------------------------
## R-squared             0.227     0.317     0.317     0.336
## adj. R-squared        0.226     0.316     0.316     0.334
## sigma                 0.710     0.668     0.668     0.659
## F                   468.267   370.379   246.976   201.777
## p                     0.000     0.000     0.000     0.000
## Log-likelihood    -1721.057 -1621.814 -1621.596 -1599.093
## Deviance            805.870   711.796   711.603   691.852
## AIC                3448.114  3251.628  3253.192  3210.186
## BIC                3464.245  3273.136  3280.078  3242.448
## N                  1599      1599      1599      1599    
## =========================================================

The variables in this linear model account for 33.6% of the variance in the quality of red wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

A scatter plot alcohol vs. volatile acidity by quality suggests that we can build a good classification model that categorize red wines by quality. I wasn’t able to create any actual models but one simple example is as follows:
if volatile acidity > = 0.8 g/dm^3: quality is 3 or 4
else if alcohol <= 10 % by volume: quality is 5
else if alcohol >= 12 % by volume or volatile acidity <= 0.4 g/dm^3: quality is 7 or 8
else: quality is 5 or 6

Similar models could be built also for the alcohol vs. citric acid plot or the alcohol vs. sulphates plot.

Were there any interesting or surprising interactions between features?

If alcohol is equal to or less than 10 % by volume, quality of red wines mostly be 6 or worse, regardless of the other features. This is consistent with what I observed from the boxplot in the Bivariate Plots Section.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes. I created a linear model using quality, alcohol, volatile acidity, citric acid, and sulphates.

The variables in the linear model account for 33.6% of the variance in the red wine quality. The addition of the citric acid to the model did not improve the R^2 value perhaps due to the variable’s (negative) correlation with the volatile acidity.


Final Plots and Summary

Plot One

Description One

Quality can be regarded as an ordered variable between 3 to 8. The asymmetric distribution of red wine quality (there are more red wines with quality = 7 than those with quality = 4) is perhaps due to the right-skewed distributions of the main features that contribute most to the quality like volatile acidity or alcohol.

Plot Two

Description Two

Red wines with better quality have a larger median alcohol. If alcohol is smaller than 10 % by volume, the quality of red wines are mostly 6 or worse.

Plot Three

Description Three

We see red wines with better quality are distributed in the region with small volatile acidity and large alcohol, while worse ones are in the region with large volatile acidity and small alcohol. This suggests that we could build a model to classify red wines by some clustering techniques.


Reflection

The data set I used contains 1599 red wines with 11 variables on the chemical properties of the wine. I explored the quality of red wines across different variables and found alcohol and volatile acidity, as well as citric acid and sulphates are the main features that contribute to the quality. The other features either correlated with at least one of the main features or did not have much impact on the quality. I struggled building a model to actually predict the quality of red wines using these features because the output, i.e. quality, is not a continuous but a discrete ordered variable. Even though I tried to build a linear model, it only explained 34% of the red wine quality. Then I proposed to apply some classification techniques for the scatter plots like alcohol vs. volatile acidity to categorize red wine data into different quality ranges.
One suggestion for the future investigation: So far even a good classification method won’t be able to predict clearly the quality of red wines in the very good quality range (such as quality = 7 and quality = 8) or in the very poor quality range (such as quality = 4 and quality = 3). This is partly due to a lack of red wine data in those quality ranges so I would consider increasing the number of very good and very poor red wine samples. Furthermore, it is critical but quite hard to select the right features to predict the quality of red wines since we don’t know exactly how people actually sense and feel the taste of food and drinks. I’m interested in how the state-of-the-art machine learning techniques like deep neural network, incorporating much more variables related to red wines, could help improve the accuracy of quality prediction.